Bollywood Movies & Indian Socio-economy: Trends, Genres & Impact

Data Analysis and visualizations prepared by - Vineet Rai

This research and analysis will focus on the trend of Bollywood movies, it’s subjects, popularity and nominations won over the years. The motive is to analyze the trend and patterns of bollywood movies and it's audience.

I will try to put my observations to identify if there are any relations between the subjects of the movies, nominations and the generations over these 7 decades ranging from 1950-2019.

I also tried to identify the socio-economic conditions and trend for India as a country. Just as an experiment, based only over the available data and visualization methods, I try to analyze if there is any pattern and relation we may infer based on movies and socio-economic parameters.

Key indicators for movies data: Movies, imdb ratings/votes (people’s choice), genres, year_of_release, key crew (director, story, producer)

Key indicators used for socio-economic factors of India: population growth, per capita income over the last 50 years.

Datasets and methods: The key sources for data to do this analysis are as below:

  1. Movies, genres, crew, nominations, year, and related data from: https://github.com/pncnmnp/TIMDB/tree/master/1950-2019 bollywood_crew.csv bollywood_crew_data.csv bollywood_full.csv bollywood_writers_data.csv
  2. Socio-economic data for India over last 50 years, through: https://databank.worldbank.org/reports.aspx?source=2&country=IND

Since the format of sample data made available through the imdb datasets is not well orgnized and is text heavy, we will have to use some clean up to extract numeric data, and use some assumptions.

The first to start with is splitting and refining the 'genres' into more meaningful categories. As the data consists 174 unique genres combinations, let's try to split the genres into primary and secondary genres. This can be done by using the split() and utilize the '|' demarcation provided in the datatset.

We will consider the primary and secondary genres of the movie database, to identify the right selections and patterns.

genre1 -> Primary genre & genre2 -> secondary genre

Let's try to create a data visualization to see what were the shares of these primary genres in the overall movies history available from 1950-2022.

Based on this sample data and pie chart visualization, we can identify that the genres like Action, Drama, Comedy and Romance were the biggest choice of filmmakers and naturally audience during the last 7 decades of Bollywood movies.

Based on the primary genres as mentioned in the sample movies dataset, we can infer that most popular genres over the last 7 decades have been: (Note the key assumption: The categorization is based on IMDB's definition or categorization as primary genre. To add more insights, we can also refer the secondary genres as well, as shown in the pie chart above)

The overall assumption based on above two analysis is to consider below popular genres (counts attached):

Genre Counts

Action 1438 Drama 1401 Comedy 789 Crime 187 Romance 119 Musical 63

Let's focus on these popular genres first for the next exercises.

By looking at the above graph, we can deduce some inferences as:

  1. Action movies overall during the last 7 decades, have shown an increase in production, with a variation of growth trend.
  2. The peak production years during this selected time period (1950-2019) was the 1990s. During late 80s till late 90s, the production of action movies were at it's peak.
  3. Action movies production took a hit and showed a slowed down pattern post 2000s.

Let's look at the other genres trend, and we can compare these to find more insights.

The trend of Musical movies genre is interesting as we can see there is a steep decline during the period 1970s-1990s. The musical genre gains popularity in production back for a couple of decades, and then falling down again post mid-2000s.

This overlaps with the phase when action movies gained lot steep incline.

Based on the data made available by IMDB's sample dataset, and as we analyzed each of the most popular genres as above, let's try to plot all these genres in a single plots for a comparable visualization:

Looking at the collective graph showing trends of overall movies in the sample dataset, as well as the trend for all the top 6 popular genres, we can infer that:

  1. There has been a consistent 'overall' growth trend in number of movies produced in last 7 decades.
  2. The most popular genre in Bollywood movies consistently has been 'Drama'. Afterall, who doesn't love drama!! :)
  3. Surprisingly, the share of action movies which showed a growth from 1950-1990s, started falling down after the late 90s. Does that mean the action heroes started getting old, and were replaced by the romantic young generation?? Very likely!! :)
  4. Comedy and Romance genres shows a trend of growth but comparatively a slower one. We need to laugh more and be more romantic!! I would like to see a higher growth rate of comedy movies, won't you?? Luckily, after the 2000s, comedy and romance genres are showing steeper rise, which indicates a good sign!
  5. Musical movies are one of the most fluctuating genres. As we can see that during the period 1950s-1970s there was a nice growth, however the musical movies subject took a big hit during the period 1970-2000, and then a uptick with a steep and down again.

Let's do a quick analysis to see 'How much do people perceive and rate the Indian movies as per IMDB ratings?' What is the general perception of masses over the area, considering what is the percentage of movies getting how much imdb_ratings. Steps:

  1. Round off all imdb_ratings in the dataframe to create a distinct 0-10 integer range
  2. Draw a simple pie chart to aggregate overall movies in the sample dataset, and check the share of movies and their imdb_ratings.

This will provide a high level 'likeness' and 'quality' idea of all Bollywood movies!! :)

Looking at this interesting pie chart, we can infer that almost 90% of the Bollywood movies are rated 7 or below. However, approximately 75% of the movies are rated between 5-7, which we can consider as average liking and quality.

On the bright side, approximately 10% of the movies are rated as 8-9. None are rated 10 as per the IMDB ratings.

A lot needs to be done to keep the bollywood audience happier, and the good thing is that the work is in good progress! How do I know this? Let's have a look at the wins and nominations trend as a next step to know more!

Let's do some more analysis now, looking at the wins and npminations of movies in different genres, across the years. Steps:

  1. Clean up and organize the data to segregate the wins and nominations.
  2. Do proper formatting of the data columns and types
  3. Create visualizationas and find out the trend of overall wins and nominations of movies in these popular genres
  4. Compare the visual analysis and trend

As the field 'wins_nominations' is more a free text format including both wins and niminations numbers, we need to find a way to extract the numbers in a numerical format from this text field. One of the way I can think of is to:

  1. Split the field using the '&' demarcation. Thanksfully this is a very consistent demarcation throughout the sample data set.
  2. After creating 2 new columns, let's replace the text and retain the numerical data.
  3. Change the new derived fields to numeric datatype and use for the data visualization.

As we can see that still the new columns are not numerical values of wins and nimination, more data clean up is needed here. Let's replace the text part 'win' and 'nomination' from the columns and change it to a numerical field.

The above comparative analysis is indeed very interesting. My inferences here are summarized in below points:

  1. Looking at the trend of number of movies in 'Action' genre, there is an overall growth in numbers and popularity as we can see the overall increasing graph from 1950-2020.

  2. If we look at the 'Wins' line and trend, it seems the action movies were not winning much nominations during the initial decades from 1950-2000.

  3. However, we see there is huge spike during the later decades i.e. from 2000-2020.

  4. The actions movies won a lot of nominations even though the overall number of action movies per year decreased.

  5. This indicates that Bollywood movie makers, scaled up the movie qualities rather than focused on quantities.

  6. People loved these action movies and these action movies won a whopping niminations, fo ex: Dabangg (2010) - won 50 nominations and similarly movies Kick, Gangs of Wasseypur, Sultan, Chennai Express just changed the trend of winnings by movies in 'Action' Genres.

Let's look at the other popular genres and corresponding wins they secured and the trend:

A very similar trend as we see for Action movies, is what is getting displayed for other 3 genres as well namely Comedy, Drama and Roamnce. This may be a great indicator of how the quality of Bollywood movies, volume of audience and nomination categories have improved and increased over the period. Certainly a great positive trend on the quality of movies and viewerships!

Let's look at the comparison of how the different genres did amond themselves on nomination wins throughout the time period available as part of this sample dataset. We will plot them all together to see the comparative trend. (just considering the genres - Action, Romance, Drama and Comedy for simplicity and relevance)

Quick inferences looking at the wins vs genres trend as above:

  1. It shows clearly that Bollywood movies under Drama genre are consistently the top winners of nominations.
  2. The action movies during the 80s and 90s were gaining quiet popularity, however meets a decline post 2000.
  3. Comedy movies show an interesting incline in wins post 200s.

Time to think - is the new generation liking more drama and comedy? Quiet likely! :)

Now, let's have a look at some social factors of the audience and masses in India, that is the movie audience assuming there is a relation and reach of population's choice over the type and quantities of movies made. Let's analyze the growth trajectory of Per Capita Income of Indian population over the several decades. May be this gives us some visual patterns! Source of data: Socio-economic data for India over last 50 years, through: https://databank.worldbank.org/reports.aspx?source=2&country=IND

Let's organize and clean up the data in more readable format, as well as one more useful for developing the visulizations. For cleaning up and organizing the data, let's first pivot the Years in columns to Rows using the melt() from pandas.

We can see that the per-capita income has shown an incline specially a steeper one post 2000s. Let's also look at the 'Population' growth trend over the years 1970-2021.

Very obviously, the expected population trend was a growth, however what is interesting to see here is that it is almost a perfect linear growth trend!

Let's utilize all the data and analytics we have derived till now to see if there is a visual relation or pattern between:

  1. Overall movies production trend
  2. India's per capita income growth trend
  3. Movies genres trend
  4. India's population growth trend

Let's try to plot 4 graphs side by side to infer the trend and possible correlation based on data visualization only:

Looking at the 4 visualizations across movies and socio-economic data, we can draw some key inferences as:

  1. 1950s - 1970s older generation : Drama and Romance were the top genres
  1. 1970s - 2000s middle generation : Drama and Action movies were the top genres
  1. 2000s - 2020 new generation : Drama and Comdey movies are the top two genres by production
  1. There is a steeper rise in per-capita income of India post 2000s, and coincidentally we see a steep rise in movies being produced as well the nominations won by Indian movies (refer wins and nominations).
  1. This may be correlated that with higher income to spend, Indian population did find time and money for entertainmennt through movies, which may be the reason of higher quality and quantity rise in movies.

There may be some more correlation and deeper connections possible, which can be drived using the statistical analysis and correlation algorithms. However, considering the scope of this course and exercise being limited to data visualizations method and analysis, I'll conclude at this point with the above mentioned ineferences.

Thanks! Hope you liked it!